Assignment 2 CSCN8000 Artificial Intelligence Algorithms and Mathematics

Download heart disease dataset heart.csv in Exercise folder and do following, https://www.kaggle.com/fedesoriano/heart-failure-prediction

In [40]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
plotly.offline.init_notebook_mode()
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from scipy.stats import zscore
from scipy import stats
from sklearn.metrics import accuracy_score, classification_report
  1. Consider the heart disease dataset in pandas dataframe
In [41]:
# Load the dataset
data = pd.read_csv('heart.csv')

# Display the first few rows of the DataFrame
data.head()
Out[41]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
  1. Remove outliers using mean,median,Z score.
In [42]:
data.isnull().sum()
Out[42]:
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64
In [43]:
# Checking for outliers
df_num = data.select_dtypes(include=['float64', 'int64'])

# Calculate the number of rows and columns for subplots
num_rows = (len(df_num.columns) - 1) // 3 + 1
num_cols = min(3, len(df_num.columns))

# Create subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 4 * num_rows))
fig.suptitle("Distribution of Numeric Data and Outliers", y=1.02, fontsize=16)

# Flatten axes if necessary
if num_rows > 1:
    axes = axes.flatten()

# Plot boxplots for each numeric column
for i, col in enumerate(df_num.columns):
    ax = axes[i]
    sns.boxplot(data=df_num[col], ax=ax, color='green')
    ax.set_title(f"Distribution of {col}", fontsize=12)
    ax.set_xlabel("")
    ax.set_ylabel(col, fontsize=10)
    ax.tick_params(axis='both', labelsize=8)
    sns.despine()

# Remove any empty subplots
for i in range(len(df_num.columns), num_rows * num_cols):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()
In [44]:
def remove_outliers_mean_std(data, threshold=3):
    z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))
    data_no_outliers = data[(z_scores < threshold).all(axis=1)]
    return data_no_outliers

def remove_outliers_zscore(data, threshold=3):
    for column in data.select_dtypes(include=[np.number]).columns:
        mean = data[column].mean()
        std = data[column].std()
        data = data[(data[column] >= mean - threshold * std) & (data[column] <= mean + threshold * std)]
    return data

# Load the dataset
data = pd.read_csv('heart.csv')

# Removing outliers using mean and std deviation
data_no_outliers_mean_std = remove_outliers_mean_std(data)

# Removing outliers using z-scores
data_no_outliers_zscore = remove_outliers_zscore(data)

print("Shape after removing outliers using mean and std deviation:", data_no_outliers_mean_std.shape)
print("Shape after removing outliers using z-scores:", data_no_outliers_zscore.shape)
Shape after removing outliers using mean and std deviation: (899, 12)
Shape after removing outliers using z-scores: (899, 12)
  1. Convert text columns to numbers using label encoding and one hot encoding
In [45]:
# Create a copy of the DataFrame for encoding
df_encoded = data.copy()

# Define columns for label encoding
label_encode_cols = ['Sex', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

# Define columns for one-hot encoding
onehot_encode_cols = ['ChestPainType']

# Apply label encoding to selected columns
for col in label_encode_cols:
    df_encoded[col] = df_encoded[col].astype('category').cat.codes

# Apply one-hot encoding to selected columns
df_encoded = pd.get_dummies(df_encoded, columns=onehot_encode_cols, prefix=onehot_encode_cols)

# Display the encoded DataFrame
df_encoded.head()
Out[45]:
Age Sex RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease ChestPainType_ASY ChestPainType_ATA ChestPainType_NAP ChestPainType_TA
0 40 1 140 289 0 1 172 0 0.0 2 0 False True False False
1 49 0 160 180 0 1 156 0 1.0 1 1 False False True False
2 37 1 130 283 0 2 98 0 0.0 2 0 False True False False
3 48 0 138 214 0 1 108 1 1.5 1 1 True False False False
4 54 1 150 195 0 1 122 0 0.0 2 0 False False True False
  1. Applying Scaling:
In [46]:
# Separate numerical columns for scaling
numerical_cols = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']

# Create a copy of the DataFrame for scaling
df_scaled = data.copy()

# Apply standard scaling to numerical columns
scaler = StandardScaler()
df_scaled[numerical_cols] = scaler.fit_transform(df_scaled[numerical_cols])

# Display the scaled DataFrame
df_scaled.head()
Out[46]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 -1.433140 M ATA 0.410909 0.825070 -0.551341 Normal 1.382928 N -0.832432 Up 0
1 -0.478484 F NAP 1.491752 -0.171961 -0.551341 Normal 0.754157 N 0.105664 Flat 1
2 -1.751359 M ATA -0.129513 0.770188 -0.551341 ST -1.525138 N -0.832432 Up 0
3 -0.584556 F ASY 0.302825 0.139040 -0.551341 Normal -1.132156 Y 0.574711 Flat 1
4 0.051881 M NAP 0.951331 -0.034755 -0.551341 Normal -0.581981 N -0.832432 Up 0
  1. Build a machine learning classification model using support vector machine.

Demonstrate the standalone model as well as Bagging model and include observations about the performance

In [47]:
# Convert categorical variables to numerical using one-hot encoding
df_encoded = pd.get_dummies(data, columns=['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'])

# Separate features and target variable
X = df_encoded.drop('HeartDisease', axis=1)
y = df_encoded['HeartDisease']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standalone SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("Standalone SVM Accuracy:", accuracy_svm)

# Bagging SVM model (using 'estimator' instead of 'base_estimator')
bagging_model = BaggingClassifier(estimator=SVC(kernel='linear'), n_estimators=10, random_state=42)
bagging_model.fit(X_train, y_train)
y_pred_bagging = bagging_model.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print("Bagging SVM Accuracy:", accuracy_bagging)
Standalone SVM Accuracy: 0.8532608695652174
Bagging SVM Accuracy: 0.8532608695652174
  1. Now use decision tree classifier. Use standalone model as well as Bagging and check if you notice any difference in performance
In [48]:
# Convert categorical columns to one-hot encoded features
data_encoded = pd.get_dummies(data, columns=['Sex', 'ChestPainType', 'FastingBS', 'RestingECG', 'ExerciseAngina', 'ST_Slope'])

# Split the data into features (X) and target (y)
X = data_encoded.drop('HeartDisease', axis=1)
y = data_encoded['HeartDisease']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standalone Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
y_pred_dt = dt_classifier.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print("Standalone Decision Tree Accuracy:", accuracy_dt)

# Bagging with Decision Tree Classifier
bagging_dt = BaggingClassifier(estimator=DecisionTreeClassifier(random_state=42),
                               n_estimators=10, random_state=42)
bagging_dt.fit(X_train, y_train)
y_pred_bagging = bagging_dt.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
print("Bagging Decision Tree Accuracy:", accuracy_bagging)
Standalone Decision Tree Accuracy: 0.8043478260869565
Bagging Decision Tree Accuracy: 0.8586956521739131

Bagging has improved the accuracy compared to the standalone Decision Tree classifier. This improvement is likely because bagging helps reduce overfitting and variance by averaging the predictions from multiple individual decision trees, thereby improving the overall accuracy and generalization of the model.

Support Vector Machine (SVM):

Standalone SVM achieved an accuracy of 0.853, indicating a reasonable performance in classifying the data. Bagging SVM also achieved an accuracy of 0.853, suggesting that in this case, bagging did not significantly improve the SVM's performance.

Decision Tree Classifier:

The standalone Decision Tree model achieved an accuracy of 0.804, showing moderate classification performance. Bagging Decision Tree outperformed the standalone model with an accuracy of 0.859, indicating that bagging provided a notable improvement in the Decision Tree's performance. In summary, while bagging did not lead to a substantial improvement in the SVM's accuracy, it notably enhanced the accuracy of the Decision Tree model. This aligns with the typical behavior of bagging, which aims to reduce overfitting and enhance the generalization capability of models that are prone to overfitting, such as Decision Trees. The decision to use bagging should be based on the specific characteristics of the model and the dataset, as well as the desired trade-off between computational complexity and performance improvement.